[air] Horovod: Use Torch.encode_data if torch is imported #28440

krfricke · 2022-09-12T11:06:04Z

Signed-off-by: Kai Fricke [email protected]

Why are these changes needed?

Horovod with Tune does not work out of the box for GPU checkpoints as they get deserialized on the non-GPU trainer worker, leading to errors. With this PR, we detect if torch is imported and a tensor is supplied in the Horovod backend. If so, we use the torch backend to serialize the data.

Related issue number

Closes #28439

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kai Fricke <[email protected]>

…ode-torch

Signed-off-by: Kai Fricke <[email protected]>

amogkam

Thanks @krfricke, lgtm as a stopgap fix! But ultimately we should refactor the checkpoint encoding/decoding logic out of the Backends and into the framework-specific checkpoints.

Then, when saving a Torch model via TorchCheckpoint.from_model(), the same encode/decode logic will apply regardless of if I'm using TorchTrainer or HorovodTrainer.

Made an issue to track this here: #28462

amogkam · 2022-09-13T04:48:13Z

python/ray/air/_internal/torch_utils.py

@@ -190,3 +190,19 @@ def load_torch_model(
            f"to be of type `torch.nn.Module`, or a model "
            f"state dict of type dict."
        )
+
+
+def contains_tensor(obj):


Do we need this? I think if torch is installed, it should be safe to always use the TorchBackend for encoding/decoding (even if the data dict does not contain a tensor). I'm worried in the worst case contains_tensor can lead to a lot of recursion.

I originally didn't have it in, but thought the overhead wouldn't be as bad. But I agree, since it only concerns an internal communication channel and the intermediate objects are not exposed to the user, we can just do this always when torch is loaded. Updated the PR

Turns out we do need it as torch.save seems to silently fail if a full model is passed (and not a state dict).

I think it should be fine - a similar lookup has to be in pickling after all, and in most cases it should finish early.

This reverts commit 2a21445. Signed-off-by: Kai Fricke <[email protected]>

This reverts commit 93913af. Signed-off-by: Kai Fricke <[email protected]>

Kai Fricke added 2 commits September 12, 2022 11:44

[air] Horovod: Use Torch.encode_data if torch is imported

1dee262

Signed-off-by: Kai Fricke <[email protected]>

Contains tensor

2a21445

Signed-off-by: Kai Fricke <[email protected]>

krfricke requested a review from amogkam September 12, 2022 11:06

krfricke assigned amogkam Sep 12, 2022

krfricke mentioned this pull request Sep 12, 2022

[train/horovod] Fix horovod long running release test #27179

Merged

7 tasks

Kai Fricke added 2 commits September 12, 2022 13:57

Merge remote-tracking branch 'upstream/master' into train/horovod-enc…

5501766

…ode-torch

Fix test

838b18e

Signed-off-by: Kai Fricke <[email protected]>

amogkam approved these changes Sep 13, 2022

View reviewed changes

Revert "Contains tensor"

93913af

This reverts commit 2a21445. Signed-off-by: Kai Fricke <[email protected]>

krfricke force-pushed the train/horovod-encode-torch branch from 65e7d66 to 93913af Compare September 13, 2022 08:27

Revert "Revert "Contains tensor""

f0c79e6

This reverts commit 93913af. Signed-off-by: Kai Fricke <[email protected]>

krfricke merged commit 3292ce8 into ray-project:master Sep 13, 2022

krfricke deleted the train/horovod-encode-torch branch September 13, 2022 10:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[air] Horovod: Use Torch.encode_data if torch is imported #28440

[air] Horovod: Use Torch.encode_data if torch is imported #28440

krfricke commented Sep 12, 2022

amogkam left a comment

amogkam Sep 13, 2022

krfricke Sep 13, 2022

krfricke Sep 13, 2022

[air] Horovod: Use Torch.encode_data if torch is imported #28440

[air] Horovod: Use Torch.encode_data if torch is imported #28440

Conversation

krfricke commented Sep 12, 2022

Why are these changes needed?

Related issue number

Checks

amogkam left a comment

Choose a reason for hiding this comment

amogkam Sep 13, 2022

Choose a reason for hiding this comment

krfricke Sep 13, 2022

Choose a reason for hiding this comment

krfricke Sep 13, 2022

Choose a reason for hiding this comment